Service Monitoring

Guide to building a basic monitoring stack for self-hosted services and infrastructure

created: Sat Mar 14 2026 00:00:00 GMT+0000 (Coordinated Universal Time) updated: Sat Mar 14 2026 00:00:00 GMT+0000 (Coordinated Universal Time) #monitoring#self-hosting#observability

Introduction

Monitoring turns a self-hosted environment from a collection of services into an operable system. At minimum, that means collecting metrics, checking service availability, and alerting on failures that need human action.

Purpose

This guide focuses on:

  • Host and service metrics
  • Uptime checks
  • Dashboards and alerting
  • Monitoring coverage for common homelab services

Architecture Overview

A small monitoring stack often includes:

  • Prometheus for scraping metrics
  • Exporters such as node_exporter for host metrics
  • Blackbox probing for endpoint availability
  • Grafana for dashboards
  • Alertmanager for notifications

Typical flow:

Exporter or target -> Prometheus -> Grafana dashboards
Prometheus alerts -> Alertmanager -> notification channel

Step-by-Step Guide

1. Start with host metrics

Install node_exporter on important Linux hosts or run it in a controlled containerized setup.

2. Scrape targets from Prometheus

Example scrape config:

scrape_configs:
  - job_name: node
    static_configs:
      - targets:
          - "server-01.internal.example:9100"
          - "server-02.internal.example:9100"

3. Add endpoint checks

Use a blackbox probe or equivalent to test HTTPS and TCP reachability for user-facing services.

4. Add dashboards and alerts

Alert only on conditions that require action, such as:

  • Host down
  • Disk nearly full
  • Backup job missing
  • TLS certificate near expiry

Configuration Example

Example alert concept:

groups:
  - name: infrastructure
    rules:
      - alert: HostDown
        expr: up == 0
        for: 5m
        labels:
          severity: critical

Troubleshooting Tips

Metrics are missing for one host

  • Check exporter health on that host
  • Confirm firewall rules allow scraping
  • Verify the target name and port in the Prometheus config

Alerts are noisy

  • Add for durations to avoid alerting on short blips
  • Remove alerts that never trigger action
  • Tune thresholds per service class rather than globally

Dashboards look healthy while the service is down

  • Add blackbox checks in addition to internal metrics
  • Monitor the reverse proxy or external entry point, not only the app process
  • Track backups and certificate expiry separately from CPU and RAM

Best Practices

  • Monitor the services users depend on, not only the hosts they run on
  • Keep alert volume low enough that alerts remain meaningful
  • Document the owner and response path for each critical alert
  • Treat backup freshness and certificate expiry as first-class signals
  • Start simple, then add coverage where operational pain justifies it

References